Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.1 - Looking at Your Data")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | 
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | 
| 2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | 
Looking at Your Data
Spark is lazily evaluated. To look at your data you must perform a take operation to trigger your transformations to be evaluated. There are a couple of ways to perform a take operation that we'll go through here, and their performance.
For example, the toPandas() is a take operation which you've already seen in many places.
Option 1 - collect()
pets.collect()
[Row(id=u'1', breed_id=u'1', nickname=u'King', birthday=u'2014-11-22 12:30:31', age=u'5', color=u'brown'),
 Row(id=u'2', breed_id=u'3', nickname=u'Argus', birthday=u'2016-11-22 10:05:10', age=u'10', color=None),
 Row(id=u'3', breed_id=u'1', nickname=u'Chewie', birthday=u'2016-11-22 10:05:10', age=u'15', color=None)]
What Happened?
When you call collect on a dataframe, it will trigger a take operation, bring all the data to the driver node and then return all rows as a lists of Row objects.
Note
This should not be advised unless you have to look at all the rows of your dataset, you should usually sample a subset of the data. This call will execution all of the transformations that you have specified on all the data.
Option 2 - head()/take()/first()
pets.head(n=1)
[Row(id=u'1', breed_id=u'1', nickname=u'King', birthday=u'2014-11-22 12:30:31', age=u'5', color=u'brown')]
What Happened?
When you call head(n) on a dataframe, it will trigger a take operation and return the first n rows of the result dataset. The different operations will return different number of rows.
Note
- If the data is unsorted, spark will perform the all the transformations on a selected amount of partitions until the number of rows are satified. This is much optimal based on how much and large your dataset is.
- If the data is sorted, spark will perform the same as a collectand performallof thetransformationsonallof the data.
By sorted we mean, if any sort of "sorting of the data" is done during the transformations, such as sort(), orderBy(), etc.
Option 3 - toPandas()
pets.toPandas()
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | 
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | 
| 2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | 
What Happened?
When you call a toPandas() on a dataframe, it will trigger a take operation and return all of the rows.
This is as performant as the collect() function, but the most readible in my opinion.
Option 4 - show()
pets.show()
+---+--------+--------+-------------------+---+-----+
| id|breed_id|nickname|           birthday|age|color|
+---+--------+--------+-------------------+---+-----+
|  1|       1|    King|2014-11-22 12:30:31|  5|brown|
|  2|       3|   Argus|2016-11-22 10:05:10| 10| null|
|  3|       1|  Chewie|2016-11-22 10:05:10| 15| null|
+---+--------+--------+-------------------+---+-----+
What Happened?
When you call a show() on a dataframe, it will trigger a take operation return up to 20 rows.
This is as performant as the head() function and more readible. (I still perfer toPandas() 😀).
Summary
- We learnt about various functions that allow you to look at your data.
- Some functions are less performant than others based on if the resultant data is sorted or not.
- Try to refrain from looking at all the data, unless you are required to.